432 Class 04

Thomas E. Love, Ph.D.

2024-01-25

Today’s Agenda

  • Fitting two-factor ANOVA/ANCOVA models with lm
    • Incorporating an interaction between factors
    • Incorporating polynomial terms
    • Incorporating restricted cubic splines
  • Regression Diagnostics via Residual Plots
  • Validating / evaluating results with yardstick

Appendix

How the class4im data were created from smart_ohio.csv

Today’s R Setup

knitr::opts_chunk$set(comment = NA)

library(janitor)
library(broom)
library(gt)
library(mosaic)
library(patchwork)       
library(naniar)
library(simputation)    ## single imputation of missing data
library(rsample)        ## data splitting
library(yardstick)      ## evaluating fits
library(rms)            ## regression tools (Frank Harrell)
library(tidyverse)      

theme_set(theme_bw()) 

The class4im data

The class4im data

  • 894 subjects in Cleveland-Elyria with bmi and no history of diabetes (missing values singly imputed: assume MAR)
  • All subjects have hx_diabetes (all 0), and are located in the MMSA labeled Cleveland-Elyria.
  • See Course Notes Chapter on BRFSS SMART data for variable details
  • Appendix provides details on data development.

The Five Variables We’ll Use Today

9 variables in the data but we’ll use only these 5 today.

Variable Description
ID subject identifying code
bmi (outcome) Body-Mass index in kg/m2.
exerany any exercise in the past month: 1 = yes, 0 = no
genhealth self-reported overall health (5 levels)
fruit_day average fruit servings consumed per day

Data Load

class4im <- read_rds("c04/data/class4im.Rds")
class4im |> n_miss()
[1] 0
identical(nrow(class4im), n_distinct(class4im$ID))
[1] TRUE

Splitting the Sample

set.seed(432)    ## for future replication
class4im_split <- initial_split(class4im, prop = 3/4)
train_c4im <- training(class4im_split)
test_c4im <- testing(class4im_split)
c(nrow(class4im), nrow(train_c4im), nrow(test_c4im))
[1] 894 670 224

Models We’ll Build Today

  1. Predict bmi using exer_any and genhealth (both categorical)
    • without and then with an interaction between the two predictors
  2. Add in a quantitative covariate, fruit_day, first simply as a main (and linear) effect

More Models We’ll Build

  1. Incorporate the fruit_day information using a quadratic polynomial instead.
  2. Incorporate the fruit_day information using a restricted cubic spline with 4 knots instead.

We’ll fit all of these models with lm, and assess them in terms of in-sample (training) fit and out-of-sample (testing) performance.

We might, but won’t transform bmi.

bmi means by exerany and health

summaries_1 <- train_c4im |>
    group_by(exerany, health) |>
    summarise(n = n(), mean = mean(bmi), stdev = sd(bmi))
summaries_1 
# A tibble: 10 × 5
# Groups:   exerany [2]
   exerany health     n  mean stdev
     <int> <fct>  <int> <dbl> <dbl>
 1       0 E         18  27.5  3.56
 2       0 VG        54  26.9  5.27
 3       0 G         58  30.3  7.45
 4       0 F         31  35.1  9.95
 5       0 P          8  36.2 12.1 
 6       1 E         92  25.8  4.49
 7       1 VG       191  26.8  4.89
 8       1 G        152  29.1  6.27
 9       1 F         49  27.2  5.52
10       1 P         17  28.5  8.61

Code for Interaction Plot

ggplot(summaries_1, aes(x = health, y = mean, 
                        col = factor(exerany))) +
  geom_point(size = 2) +
  geom_line(aes(group = factor(exerany))) +
  scale_color_viridis_d(option = "C", end = 0.5) +
  labs(title = "Observed Means of BMI",
       subtitle = "by Exercise and Overall Health")
  • Note the use of factor here since the exerany variable is in fact numeric, although it only takes the values 1 and 0.
    • Sometimes it’s helpful to treat 1/0 as a factor, and sometimes not.
  • Where is the evidence of serious non-parallelism (if any) in the plot on the next slide that results from this code?

Resulting Interaction Plot

Fitting a Two-Way ANOVA model for BMI

Model m_1 without interaction

m_1 <- lm(bmi ~ exerany + health, data = train_c4im)
  • How well does this model fit the training data?
glance(m_1) |> 
    select(r.squared, adj.r.squared, sigma, nobs, 
           df, df.residual, AIC, BIC) |> 
  gt() |> tab_options(table.font.size = 20)
r.squared adj.r.squared sigma nobs df df.residual AIC BIC
0.08890746 0.08204682 6.115123 670 5 664 4335.776 4367.327

Tidied ANOVA for m_1

tidy(anova(m_1)) |> gt() |> tab_options(table.font.size = 20)
term df sumsq meansq statistic p.value
exerany 1 896.5102 896.51016 23.97424 1.227321e-06
health 4 1526.4943 381.62357 10.20528 5.053239e-08
Residuals 664 24830.1020 37.39473 NA NA

Tidied summary of m_1 coefficients

tidy(m_1, conf.int = TRUE, conf.level = 0.90) |> 
  gt() |> tab_options(table.font.size = 20)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 27.9013896 0.7427324 37.565871 1.863756e-166 26.6779966 29.124783
exerany -2.1979658 0.5501291 -3.995364 7.182231e-05 -3.1041118 -1.291820
healthVG 0.6253919 0.7025376 0.890190 3.736863e-01 -0.5317941 1.782578
healthG 3.1411380 0.7223962 4.348221 1.587998e-05 1.9512419 4.331034
healthF 3.7154895 0.9069472 4.096699 4.708754e-05 2.2216099 5.209369
healthP 4.5472272 1.3576232 3.349403 8.556673e-04 2.3110158 6.783439

Interpreting m_1

Name exerany health predicted bmi
Harry 0 Excellent 27.91
Sally 1 Excellent 27.91 - 2.20 = 25.71
Billy 0 Fair 27.91 + 3.71 = 31.62
Meg 1 Fair 27.91 - 2.20 + 3.71 = 29.42
  • Effect of exerany?
  • Effect of health = Fair instead of Excellent?

m_1 Residual Plots

par(mfrow = c(1,2)); plot(m_1, which = c(1,2))

m_1 Residual Plots

par(mfrow = c(1,2)); plot(m_1, which = c(3,5))

Fitting ANOVA model m_1int including interaction

Adding the interaction term to m_1

m_1int <- lm(bmi ~ exerany * health, data = train_c4im)
  • How do our models compare on fit to the training data?
bind_rows(glance(m_1), glance(m_1int)) |> 
    mutate(mod = c("m_1", "m_1int")) |>
    select(mod, r.sq = r.squared, adj.r.sq = adj.r.squared, 
       sigma, nobs, df, df.res = df.residual, AIC, BIC) |> 
  gt() |> tab_options(table.font.size = 20)
mod r.sq adj.r.sq sigma nobs df df.res AIC BIC
m_1 0.08890746 0.08204682 6.115123 670 5 664 4335.776 4367.327
m_1int 0.12632022 0.11440640 6.006371 670 9 660 4315.682 4365.262

ANOVA for the m_1int model

tidy(anova(m_1int)) |> gt() |> tab_options(table.font.size = 20)
term df sumsq meansq statistic p.value
exerany 1 896.5102 896.5102 24.850255 7.927331e-07
health 4 1526.4943 381.6236 10.578177 2.595323e-08
exerany:health 4 1019.6139 254.9035 7.065638 1.424050e-05
Residuals 660 23810.4882 36.0765 NA NA

ANOVA test comparing m_1 to m_1int

anova(m_1, m_1int)
Analysis of Variance Table

Model 1: bmi ~ exerany + health
Model 2: bmi ~ exerany * health
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    664 24830                                  
2    660 23811  4    1019.6 7.0656 1.424e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m_1int coefficients

tidy(m_1int, conf.int = TRUE, conf.level = 0.90) |>
  rename(se = std.error, t = statistic, p = p.value) |> 
  gt() |> tab_options(table.font.size = 20)
term estimate se t p conf.low conf.high
(Intercept) 27.4872222 1.415715 19.4157837 9.054815e-67 25.1553046 29.8191398
exerany -1.7027657 1.548026 -1.0999591 2.717510e-01 -4.2526216 0.8470902
healthVG -0.6140741 1.634727 -0.3756431 7.073029e-01 -3.3067406 2.0785924
healthG 2.8362261 1.620573 1.7501373 8.055931e-02 0.1668731 5.5055790
healthF 7.6366487 1.779890 4.2905176 2.049454e-05 4.7048754 10.5684221
healthP 8.6790278 2.552217 3.4005837 7.129569e-04 4.4751035 12.8829520
exerany:healthVG 1.6259526 1.803705 0.9014516 3.676771e-01 -1.3450480 4.5969533
exerany:healthG 0.4982648 1.804367 0.2761438 7.825240e-01 -2.4738262 3.4703558
exerany:healthF -6.2209012 2.072776 -3.0012420 2.790272e-03 -9.6351059 -2.8066964
exerany:healthP -5.9623078 3.004679 -1.9843412 4.763155e-02 -10.9115115 -1.0131042

Interpreting the m_1int model

Name exerany health predicted bmi
Harry 0 Excellent 27.49
Sally 1 Excellent 27.49 - 1.69 = 25.80
Billy 0 Fair 27.49 + 7.64 = 35.13
Meg 1 Fair 27.49 - 1.69 + 7.64 - 6.22 = 27.22
  • How do we interpret effect sizes here? It depends.

Interpreting the m_1int model

  • Effect of exerany?
    • If health = Excellent, effect is -1.69
    • If health = Fair, effect is (-1.69 - 6.22) = -7.91
  • Effect of health = Fair instead of Excellent?
    • If exerany = 0 (no), effect is 7.64
    • If exerany = 1 (yes), effect is (7.64 - 6.22) = 1.42

Residuals from m_1int?

par(mfrow = c(1,2)); plot(m_1int, which = c(1,2))

Residuals from m_1int?

par(mfrow = c(1,2)); plot(m_1int, which = c(3,5))

Incorporating a Covariate (as a main and linear effect) into our two-way ANOVA models

Add fruit_day to m_1

m_2 <- lm(bmi ~ fruit_day + exerany + health, data = train_c4im)
  • How well does this model fit the training data?
bind_rows(glance(m_1), glance(m_2)) |>
  mutate(mod = c("m_1", "m_2")) |>
  select(mod, r.sq = r.squared, adj.r.sq = adj.r.squared, 
         sigma, df, df.res = df.residual, AIC, BIC) |> 
  gt() |> tab_options(table.font.size = 20)
mod r.sq adj.r.sq sigma df df.res AIC BIC
m_1 0.08890746 0.08204682 6.115123 5 664 4335.776 4367.327
m_2 0.09790484 0.08974108 6.089441 6 663 4331.126 4367.184

ANOVA for the m_2 model

tidy(anova(m_2)) |> gt() |> tab_options(table.font.size = 20)
term df sumsq meansq statistic p.value
fruit_day 1 467.1274 467.12739 12.597388 4.135820e-04
exerany 1 761.3825 761.38250 20.532794 6.953235e-06
health 4 1439.7012 359.92529 9.706385 1.239354e-07
Residuals 663 24584.8954 37.08129 NA NA

m_2 coefficients

tidy(m_2, conf.int = TRUE, conf.level = 0.90) |>
  gt() |> tab_options(table.font.size = 20)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 28.6715422 0.7979500 35.9315038 9.237344e-158 27.3571948 29.9858896
fruit_day -0.5458806 0.2122799 -2.5715128 1.034232e-02 -0.8955386 -0.1962226
exerany -2.0476701 0.5509276 -3.7167681 2.188237e-04 -2.9551334 -1.1402069
healthVG 0.5607594 0.7000384 0.8010409 4.233949e-01 -0.5923125 1.7138313
healthG 3.0052279 0.7213012 4.1663980 3.504237e-05 1.8171329 4.1933229
healthF 3.5515381 0.9053858 3.9226793 9.669720e-05 2.0602272 5.0428490
healthP 4.5620682 1.3519338 3.3744760 7.826611e-04 2.3352234 6.7889129

m_2 Residuals

par(mfrow = c(1,2)); plot(m_2, which = c(1,2))

m_2 Residuals

par(mfrow = c(1,2)); plot(m_2, which = c(3,5))

Who is that poorest fit case?

Plot suggests we look at row 28

train_c4im |> slice(28) |>
  select(ID, bmi, fruit_day, exerany, health) |> 
  gt() |> tab_options(table.font.size = 20)
ID bmi fruit_day exerany health
320 63 1 0 P

What is unusual about this subject?

train_c4im |> arrange(desc(bmi)) 
# A tibble: 670 × 9
   ID      bmi inc_imp fruit_day drinks_wk female exerany health race_eth       
   <chr> <dbl>   <dbl>     <dbl>     <dbl>  <int>   <int> <fct>  <fct>          
 1 320    63     20581      1         0.7       1       0 P      Black non-Hisp…
 2 959    59.0    5720      0.1       0         1       0 F      White non-Hisp…
 3 1078   56.3   43948      1.02      0         1       0 F      White non-Hisp…
 4 633    51.5   39964      1         0         0       0 G      White non-Hisp…
 5 947    51.2   10280      2.02      0.93      1       0 F      White non-Hisp…
 6 535    50.5   22307      0.46      0         0       1 G      Hispanic       
 7 888    50.0  113725      2.02      0         1       1 G      White non-Hisp…
 8 743    49.0   31469      0.13      0         0       0 G      Black non-Hisp…
 9 551    48.8   91510      1.14      0.93      0       1 G      White non-Hisp…
10 924    48.2   11825      2         0         1       1 P      Multiracial no…
# ℹ 660 more rows

Include the interaction term?

m_2int <- lm(bmi ~ fruit_day + exerany * health, 
          data = train_c4im)

ANOVA for the m_2int model

tidy(anova(m_2int)) |> gt() |> tab_options(table.font.size = 20)
term df sumsq meansq statistic p.value
fruit_day 1 467.1274 467.12739 13.096273 3.185537e-04
exerany 1 761.3825 761.38250 21.345939 4.612955e-06
health 4 1439.7012 359.92529 10.090780 6.231466e-08
exerany:health 4 1079.2034 269.80084 7.564072 5.829277e-06
Residuals 659 23505.6920 35.66873 NA NA

m_2int coefficients

tidy(m_2int, conf.int = TRUE, conf.level = 0.90) |>
  rename(se = std.error, t = statistic, p = p.value) |> 
  gt() |> tab_options(table.font.size = 18)
term estimate se t p conf.low conf.high
(Intercept) 28.2736467 1.433168 19.7280703 1.964272e-68 25.91297599 30.6343175
fruit_day -0.6101570 0.208728 -2.9232160 3.583167e-03 -0.95396723 -0.2663467
exerany -1.4458021 1.541761 -0.9377602 3.487112e-01 -3.98534313 1.0937390
healthVG -0.6573334 1.625530 -0.4043810 6.860638e-01 -3.33485589 2.0201891
healthG 2.7490624 1.611665 1.7057285 8.852985e-02 0.09437803 5.4037469
healthF 7.5895769 1.769876 4.2881981 2.070809e-05 4.67429252 10.5048613
healthP 9.0747824 2.541361 3.5708356 3.817403e-04 4.88873094 13.2608338
exerany:healthVG 1.5952539 1.793513 0.8894576 3.740817e-01 -1.35896550 4.5494733
exerany:healthG 0.4226432 1.794327 0.2355441 8.138596e-01 -2.53291698 3.3782033
exerany:healthF -6.4107251 2.062051 -3.1089073 1.958699e-03 -9.80727160 -3.0141785
exerany:healthP -6.4994464 2.993295 -2.1713351 3.026195e-02 -11.42990959 -1.5689833

ANOVA: Compare m_2 & m_2int

anova(m_2, m_2int)
Analysis of Variance Table

Model 1: bmi ~ fruit_day + exerany + health
Model 2: bmi ~ fruit_day + exerany * health
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    663 24585                                  
2    659 23506  4    1079.2 7.5641 5.829e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m_2int Residuals

par(mfrow = c(1,2)); plot(m_2int, which = c(1,2))

m_2int Residuals

par(mfrow = c(1,2)); plot(m_2int, which = c(3,5))

Which of the four models fits best?

In the training sample, we have…

mod r.sq adj.r.sq sigma df df.res AIC BIC
m_1 0.08890746 0.08204682 6.115123 5 664 4335.776 4367.327
m_2 0.09790484 0.08974108 6.089441 6 663 4331.126 4367.184
m_1int 0.12632022 0.11440640 6.006371 9 660 4315.682 4365.262
m_2int 0.13750412 0.12441617 5.972330 10 659 4309.050 4363.137
  • Adjusted \(R^2\), \(\sigma\), AIC and BIC all improve as we move down from m1 towards m2_int.
  • BUT the training sample cannot judge between models accurately. Our models have already seen that data.

Predicting bmi in the test sample

  • For fairer comparisons, consider the held-out sample.

We’ll use augment from the broom package…

m1_test_aug <- augment(m_1, newdata = test_c4im)
m1int_test_aug <- augment(m_1int, newdata = test_c4im)
m2_test_aug <- augment(m_2, newdata = test_c4im)
m2int_test_aug <- augment(m_2int, newdata = test_c4im)

This adds fitted values (predictions) and residuals (errors) …

m1_test_aug |> select(ID, bmi, .fitted, .resid) |> 
    slice(1:2) |> gt() |> tab_options(table.font.size = 20)
ID bmi .fitted .resid
4 26.51 28.84456 -2.334562
5 24.25 28.84456 -4.594562

The yardstick package

For each subject in the testing set, we will need:

  • estimate = model’s prediction of that subject’s bmi
  • truth = the bmi value observed for that subject

Calculate a summary of the predictions across the \(n\) test subjects

Summaries from yardstick

  • \(R^2\) = square of the correlation between truth and estimate
  • mae = mean absolute error …

\[ mae = \frac{1}{n} \sum{|truth - estimate|} \]

  • rmse = root mean squared error …

\[ rmse = \sqrt{\frac{1}{n} \sum{(truth - estimate)^2}} \]

Testing Results (using \(R^2\))

We can use the yardstick package and its rsq() function.

testing_r2 <- bind_rows(
    rsq(m1_test_aug, truth = bmi, estimate = .fitted),
    rsq(m1int_test_aug, truth = bmi, estimate = .fitted),
    rsq(m2_test_aug, truth = bmi, estimate = .fitted),
    rsq(m2int_test_aug, truth = bmi, estimate = .fitted)) |>
    mutate(model = c("m_1", "m_1int", "m_2", "m_2int"))
testing_r2 |> gt() |> tab_options(table.font.size = 20)
.metric .estimator .estimate model
rsq standard 0.07161065 m_1
rsq standard 0.03973444 m_1int
rsq standard 0.06517957 m_2
rsq standard 0.03635748 m_2int

Mean Absolute Error?

Consider the mean absolute prediction error …

testing_mae <- bind_rows(
    mae(m1_test_aug, truth = bmi, estimate = .fitted),
    mae(m1int_test_aug, truth = bmi, estimate = .fitted),
    mae(m2_test_aug, truth = bmi, estimate = .fitted),
    mae(m2int_test_aug, truth = bmi, estimate = .fitted)) |>
    mutate(model = c("m_1", "m_1int", "m_2", "m_2int"))
testing_mae |> gt() |> tab_options(table.font.size = 20)
.metric .estimator .estimate model
mae standard 4.428381 m_1
mae standard 4.623948 m_1int
mae standard 4.476413 m_2
mae standard 4.706638 m_2int

Root Mean Squared Error?

How about the square root of the mean squared prediction error, or RMSE?

testing_rmse <- bind_rows(
   rmse(m1_test_aug, truth = bmi, estimate = .fitted),
   rmse(m1int_test_aug, truth = bmi, estimate = .fitted),
   rmse(m2_test_aug, truth = bmi, estimate = .fitted),
   rmse(m2int_test_aug, truth = bmi, estimate = .fitted)) |>
   mutate(model = c("m_1", "m_1int", "m_2", "m_2int"))
testing_rmse |> gt() |> tab_options(table.font.size = 20)
.metric .estimator .estimate model
rmse standard 5.728520 m_1
rmse standard 6.025257 m_1int
rmse standard 5.769484 m_2
rmse standard 6.082435 m_2int

Other yardstick summaries (1)

  • rsq_trad() = defines \(R^2\) using sums of squares.
    • The rsq() measure we showed a few slides ago is a squared correlation coefficient guaranteed to be in (0, 1).
  • mape() = mean absolute percentage error
  • mpe() = mean percentage error

Other yardstick summaries (2)

  • huber_loss() = Huber loss (often used in robust regression), which is less sensitive to outliers than rmse().
  • ccc() = concordance correlation coefficient, which attempts to measure both consistency/correlation (like rsq()) and accuracy (like rmse()).

See the yardstick home page for more details.

Can I STOP HERE???!?!?!?!?

Incorporating Non-Linearity into our models

Incorporating a non-linear term for fruit_day

Suppose we wanted to include a polynomial term for fruit_day:

lm(bmi ~ fruit_day, data = train_c4im)
lm(bmi ~ poly(fruit_day, 2), data = train_c4im)
lm(bmi ~ poly(fruit_day, 3), data = train_c4im)

Polynomial Regression

A polynomial in the variable x of degree D is a linear combination of the powers of x up to D.

For example:

  • Linear: \(y = \beta_0 + \beta_1 x\)
  • Quadratic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2\)
  • Cubic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\)
  • Quartic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4\)
  • Quintic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + \beta_5 x^5\)

Fitting such a model creates a polynomial regression.

Raw Polynomials vs. Orthogonal Polynomials

Predict bmi using fruit_day with a polynomial of degree 2.

(temp1 <- lm(bmi ~ fruit_day + I(fruit_day^2), 
             data = train_c4im))

Call:
lm(formula = bmi ~ fruit_day + I(fruit_day^2), data = train_c4im)

Coefficients:
   (Intercept)       fruit_day  I(fruit_day^2)  
       29.5866         -1.2726          0.1052  

This uses raw polynomials. Predicted bmi for fruit_day = 2 is

bmi = 29.5925 - 1.2733 (fruit_day) + 0.1051 (fruit_day^2)
    = 29.5925 - 1.2733 (2) + 0.1051 (4) 
    = 27.466

Does the raw polynomial match our expectations?

temp1 <- lm(bmi ~ fruit_day + I(fruit_day^2), 
             data = train_c4im)
augment(temp1, newdata = tibble(fruit_day = 2)) |> 
  gt() |> tab_options(table.font.size = 20)
fruit_day .fitted
2 27.46205

and this matches our “by hand” calculation. But it turns out most regression models use orthogonal rather than raw polynomials…

Fitting an Orthogonal Polynomial

Predict bmi using fruit_day with an orthogonal polynomial of degree 2.

(temp2 <- lm(bmi ~ poly(fruit_day,2), data = train_c4im))

Call:
lm(formula = bmi ~ poly(fruit_day, 2), data = train_c4im)

Coefficients:
        (Intercept)  poly(fruit_day, 2)1  poly(fruit_day, 2)2  
             28.084              -21.613                8.011  

This looks very different from our previous version of the model.

  • What happens when we make a prediction, though?

Prediction in the Orthogonal Polynomial Model

Remember that in our raw polynomial model, our “by hand” and “using R” calculations both concluded that the predicted bmi for a subject with fruit_day = 2 was 27.466.

Now, what happens with the orthogonal polynomial model temp2 we just fit?

augment(temp2, newdata = data.frame(fruit_day = 2)) |> 
  gt() |> tab_options(table.font.size = 20)
fruit_day .fitted
2 27.46205
  • No change in the prediction.

Fits of raw vs orthogonal polynomials

  • The two models are, in fact, identical.

Why do we use orthogonal polynomials?

  • The main reason is to avoid having to include powers of our predictor that are highly collinear.
  • Variance Inflation Factor assesses collinearity…
vif(temp1)        ## from rms package
     fruit_day I(fruit_day^2) 
      4.652178       4.652178 
  • Orthogonal polynomial terms are uncorrelated with one another, easing the process of identifying which terms add value to our model.
vif(temp2)      
poly(fruit_day, 2)1 poly(fruit_day, 2)2 
                  1                   1 

Why orthogonal rather than raw polynomials?

The tradeoff is that the raw polynomial is a lot easier to explain in terms of a single equation in the simplest case.

Actually, we’ll usually avoid polynomials in our practical work, and instead use splines, which are more flexible and require less maintenance, but at the cost of pretty much requiring you to focus on visualizing their predictions rather than their equations.

Adding a Second Order Polynomial to our Models

m_3 <- lm(bmi ~ poly(fruit_day,2) + exerany + health,
          data = train_c4im)
  • Comparison to other models without the interaction…
mod r.sq adj.r.sq sigma df df.res AIC BIC
m_1 0.08890746 0.08204682 6.115123 5 664 4335.776 4367.327
m_2 0.09790484 0.08974108 6.089441 6 663 4331.126 4367.184
m_3 0.09794577 0.08840743 6.093900 7 662 4333.096 4373.661

Tidied summary of m_3 coefficients

term est se t p conf.low conf.high
(Intercept) 27.8644132 0.7424070 37.5325307 4.410943e-166 26.6415511 29.087275
poly(fruit_day, 2)1 -15.8554311 6.1642322 -2.5721664 1.032334e-02 -26.0088995 -5.701963
poly(fruit_day, 2)2 1.0812412 6.2385699 0.1733156 8.624564e-01 -9.1946732 11.357156
exerany -2.0319807 0.5587135 -3.6368924 2.973984e-04 -2.9522704 -1.111691
healthVG 0.5606014 0.7005517 0.8002285 4.238655e-01 -0.5933183 1.714521
healthG 3.0024230 0.7220108 4.1584183 3.626319e-05 1.8131567 4.191689
healthF 3.5528349 0.9060797 3.9211064 9.733049e-05 2.0603779 5.045292
healthP 4.5325003 1.3636377 3.3238304 9.368376e-04 2.2863728 6.778628

m_3 Residuals

par(mfrow = c(1,2)); plot(m_3, which = c(1,2))

m_3 Residuals

par(mfrow = c(1,2)); plot(m_3, which = c(3,5))

Add in the interaction

m_3int <- lm(bmi ~ poly(fruit_day,2) + exerany * health,
          data = train_c4im)
  • Comparison to other models with the interaction…
mod r.sq adj.r.sq sigma df df.res AIC BIC
m_1int 0.1263202 0.1144064 6.006371 9 660 4315.682 4365.262
m_2int 0.1375041 0.1244162 5.972330 10 659 4309.050 4363.137
m_3int 0.1375923 0.1231751 5.976561 11 658 4310.982 4369.576

Tidied summary of m_3int coefficients

term est se t p conf.low conf.high
(Intercept) 27.3965514 1.410188 19.4275830 8.553086e-67 25.07372766 29.719375
poly(fruit_day, 2)1 -17.6962306 6.060147 -2.9200991 3.618914e-03 -27.67833946 -7.714122
poly(fruit_day, 2)2 -1.6306343 6.287045 -0.2593642 7.954354e-01 -11.98648356 8.725215
exerany -1.4683718 1.545305 -0.9502147 3.423520e-01 -4.01375639 1.077013
healthVG -0.6579448 1.626683 -0.4044702 6.859984e-01 -3.33737270 2.021483
healthG 2.7455162 1.612864 1.7022610 8.917878e-02 0.08884996 5.402182
healthF 7.5801726 1.771501 4.2789559 2.156689e-05 4.66220539 10.498140
healthP 9.2235030 2.607003 3.5379723 4.315760e-04 4.92931949 13.517686
exerany:healthVG 1.5963008 1.794788 0.8894090 3.741084e-01 -1.36002531 4.552627
exerany:healthG 0.4332807 1.796067 0.2412387 8.094453e-01 -2.52515103 3.391713
exerany:healthF -6.3985866 2.064042 -3.1000268 2.017633e-03 -9.79842072 -2.998752
exerany:healthP -6.6523428 3.052872 -2.1790438 2.968244e-02 -11.68095081 -1.623735

m_3int Residuals

par(mfrow = c(1,2)); plot(m_3int, which = c(1,2))

m_3int Residuals

par(mfrow = c(1,2)); plot(m_3int, which = c(3,5))

How do models m_3 and m_3int do in testing?

m3_test_aug <- augment(m_3, newdata = test_c4im)
m3int_test_aug <- augment(m_3int, newdata = test_c4im)

testing_r2 <- bind_rows(
    rsq(m1_test_aug, truth = bmi, estimate = .fitted),
    rsq(m2_test_aug, truth = bmi, estimate = .fitted),
    rsq(m3_test_aug, truth = bmi, estimate = .fitted),
    rsq(m1int_test_aug, truth = bmi, estimate = .fitted),
    rsq(m2int_test_aug, truth = bmi, estimate = .fitted),
    rsq(m3int_test_aug, truth = bmi, estimate = .fitted)) |>
    mutate(model = c("m_1", "m_2", "m_3", "m_1int",
                     "m_2int", "m_3int"))
  • I’ve hidden my calculations for RMSE and MAE here.

Results comparing all six models (testing)

bind_cols(testing_r2 |> select(model, rsquare = .estimate), 
          testing_rmse |> select(rmse = .estimate),
          testing_mae |> select(mae = .estimate)) |> 
  gt() |> tab_options(table.font.size = 20)
model rsquare rmse mae
m_1 0.07161065 5.728520 4.428381
m_2 0.06517957 5.769484 4.476413
m_3 0.06555978 5.767872 4.476380
m_1int 0.03973444 6.025257 4.623948
m_2int 0.03635748 6.082435 4.706638
m_3int 0.03572974 6.089895 4.707316
  • Did the polynomial term in m_3 and m_3int improve our predictions?

Splines

  • A linear spline is a continuous function formed by connecting points (called knots of the spline) by line segments.
  • A restricted cubic spline is a way to build highly complicated curves into a regression equation in a fairly easily structured way.
  • A restricted cubic spline is a series of polynomial functions joined together at the knots.
    • Such a spline gives us a way to flexibly account for non-linearity without over-fitting the model.
    • Restricted cubic splines can fit many different types of non-linearities.
    • Specifying the number of knots is all you need to do in R to get a reasonable result from a restricted cubic spline.

The most common choices are 3, 4, or 5 knots.

  • 3 Knots, 2 degrees of freedom, allows the curve to “bend” once.
  • 4 Knots, 3 degrees of freedom, lets the curve “bend” twice.
  • 5 Knots, 4 degrees of freedom, lets the curve “bend” three times.

A simulated data set

set.seed(4322021)

sim_data <- tibble(
    x = runif(250, min = 10, max = 50),
    y = 3*(x-30) - 0.3*(x-30)^2 + 0.05*(x-30)^3 + 
        rnorm(250, mean = 500, sd = 70)
)

head(sim_data, 2)
# A tibble: 2 × 2
      x     y
  <dbl> <dbl>
1  42.5  397.
2  35.9  414.

The sim_data, plotted.

Fitting Restricted Cubic Splines with lm and rcs

sim_linear <- lm(y ~ x, data = sim_data)
sim_poly2  <- lm(y ~ poly(x, 2), data = sim_data)
sim_poly3  <- lm(y ~ poly(x, 3), data = sim_data)
sim_rcs3   <- lm(y ~ rcs(x, 3), data = sim_data)
sim_rcs4   <- lm(y ~ rcs(x, 4), data = sim_data)
sim_rcs5   <- lm(y ~ rcs(x, 5), data = sim_data)

Looking at the Polynomial Fits

Looking at the Restricted Cubic Spline Fits

Fitting Restricted Cubic Splines with lm and rcs

For most applications, three to five knots strike a nice balance between complicating the model needlessly and fitting data pleasingly. Let’s consider a restricted cubic spline model for bmi based on fruit_day again, but now with:

  • in temp3, 3 knots, and
  • in temp4, 4 knots,
temp3 <- lm(bmi ~ rcs(fruit_day, 3), data = train_c4im)
temp4 <- lm(bmi ~ rcs(fruit_day, 4), data = train_c4im)

Spline models for bmi and fruit_day

Let’s try an RCS with 4 knots

m_4 <- lm(bmi ~ rcs(fruit_day, 4) + exerany + health,
          data = train_c4im)

m_4int <- lm(bmi ~ rcs(fruit_day, 4) + exerany * health,
          data = train_c4im)

Comparing 4 models including the exerany*health interaction…

mod fruit r.sq adj.r.sq sigma df AIC BIC
m_1int not in 0.1263202 0.1144064 6.006371 9 4315.682 4365.262
m_2int linear 0.1375041 0.1244162 5.972330 10 4309.050 4363.137
m_3int poly(2) 0.1375923 0.1231751 5.976561 11 4310.982 4369.576
m_4int rcs(4) 0.1378531 0.1221061 5.980203 12 4312.779 4375.881

Tidied summary of m_4int coefficients

term est se t p lo90 hi90
(Intercept) 28.5634270 1.543775 18.5023217 7.751118e-62 26.0205571 31.1062969
rcs(fruit_day, 4)fruit_day -1.2119249 1.265428 -0.9577193 3.385566e-01 -3.2963080 0.8724582
rcs(fruit_day, 4)fruit_day' 2.3145244 5.822397 0.3975209 6.911125e-01 -7.2759887 11.9050376
rcs(fruit_day, 4)fruit_day'' -5.9458573 16.583638 -0.3585376 7.200562e-01 -33.2620308 21.3703162
exerany -1.3564307 1.553859 -0.8729432 3.830130e-01 -3.9159102 1.2030489
healthVG -0.6374547 1.628132 -0.3915251 6.955361e-01 -3.3192756 2.0443661
healthG 2.7744698 1.614570 1.7183953 8.619565e-02 0.1149883 5.4339514
healthF 7.6381022 1.774779 4.3036922 1.935601e-05 4.7147286 10.5614758
healthP 9.0463580 2.564614 3.5273755 4.489322e-04 4.8219862 13.2707299
exerany:healthVG 1.5709516 1.796510 0.8744463 3.821948e-01 -1.3882172 4.5301205
exerany:healthG 0.3723381 1.799358 0.2069283 8.361299e-01 -2.5915211 3.3361972
exerany:healthF -6.4655738 2.067528 -3.1272008 1.842739e-03 -9.8711559 -3.0599917
exerany:healthP -6.4950803 3.022963 -2.1485805 3.203248e-02 -11.4744338 -1.5157268

m_4int Residual Plots

par(mfrow = c(1,2)); plot(m_4int, which = c(1,2))

m_4int Residual Plots

par(mfrow = c(1,2)); plot(m_4int, which = c(3,5))

How do models m_4 and m_4int do in testing?

model rsquare rmse mae
m_1 0.07161065 5.728520 4.428381
m_2 0.06517957 5.769484 4.476413
m_3 0.06555978 5.767872 4.476380
m_4 0.06688129 5.763084 4.470968
m_1int 0.03973444 6.025257 4.623948
m_2int 0.03635748 6.082435 4.706638
m_3int 0.03572974 6.089895 4.707316
m_4int 0.03711199 6.076300 4.698624

I’ll note that there’s a fair amount of very repetitive code in the Quarto file to create that table.

  • What are our conclusions?

Next Week

  • Using the ols function from the rms package to fit linear regression models with non-linear terms.
  • Be sure to submit Lab 2 to Canvas by Tuesday 2024-01-30 at Noon.

Appendix: How The class4 and class4im data were built from the smart_ohio.csv data created in the Course Notes

Creating Today’s Data Set

url1 <- "https://raw.githubusercontent.com/THOMASELOVE/432-data/master/data/smart_ohio.csv"

smart_ohio <- read_csv(url1)

class4 <- smart_ohio |>
    filter(hx_diabetes == 0, 
           mmsa == "Cleveland-Elyria",
           complete.cases(bmi)) |>
    select(bmi, inc_imp, fruit_day, drinks_wk, 
           female, exerany, genhealth, race_eth, 
           hx_diabetes, mmsa, SEQNO) |>            
    type.convert(as.is = FALSE) |>                       
    mutate(ID = as.character(SEQNO - 2017000000)) |>
    relocate(ID)

Codebook for useful class4 variables

  • 894 subjects in Cleveland-Elyria with bmi and no history of diabetes
Variable Description
bmi (outcome) Body-Mass index in kg/m2.
inc_imp income (imputed from grouped values) in $
fruit_day average fruit servings consumed per day
drinks_wk average alcoholic drinks consumed per week
female sex: 1 = female, 0 = male
exerany any exercise in the past month: 1 = yes, 0 = no
genhealth self-reported overall health (5 levels)
race_eth race and Hispanic/Latinx ethnicity (5 levels)
  • plus ID, SEQNO, hx_diabetes (all 0), MMSA (all Cleveland-Elyria)
  • See Course Notes Chapter on BRFSS SMART data for variable details

Basic Data Summaries

Available approaches include:

  • summary
  • mosaic package’s inspect()
  • Hmisc package’s describe

all of which can work nicely in an HTML presentation, but none of them fit well on one of these slides.

Quick Histogram of each quantitative variable

Code for previous slide

p1 <- ggplot(class4, aes(x = bmi)) + 
    geom_histogram(fill = "navy", col = "white", bins = 20)
p2 <- ggplot(class4, aes(x = inc_imp)) + 
    geom_histogram(fill = "forestgreen", col = "white", 
                   bins = 20)
p3 <- ggplot(class4, aes(x = fruit_day)) + 
    geom_histogram(fill = "tomato", col = "white", bins = 20)
p4 <- ggplot(class4, aes(x = drinks_wk)) + 
    geom_histogram(fill = "royalblue", col = "white", 
                   bins = 20)
(p1 + p2) / (p3 + p4)

I also used #| warning: false in the plot’s code chunk label to avoid warnings about missing values, like this one for inc_imp:

Warning: Removed 120 rows containing non-finite values

Binary variables in raw class4

class4 |> tabyl(female, exerany) |> adorn_title()
        exerany        
 female       0   1 NA_
      0      95 268  20
      1     128 361  22
  • female is based on biological sex (1 = female, 0 = male)
  • exerany comes from a response to “During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?” (1 = yes, 0 = no, don’t know and refused = missing)
  • Any signs of trouble here?
  • I think the 1/0 values and names are OK choices.

Multicategorical genhealth in raw class4

class4 |> tabyl(genhealth)
   genhealth   n     percent valid_percent
 1_Excellent 148 0.165548098    0.16573348
  2_VeryGood 324 0.362416107    0.36282195
      3_Good 274 0.306487696    0.30683091
      4_Fair 112 0.125279642    0.12541993
      5_Poor  35 0.039149888    0.03919373
        <NA>   1 0.001118568            NA
  • The variable is based on “Would you say that in general your health is …” using the five specified categories (Excellent -> Poor), numbered for convenience after data collection.
  • Don’t know / not sure / refused were each treated as missing.
  • How might we manage this variable?

Changing the levels for genhealth

class4 <- class4 |>
    mutate(health = 
               fct_recode(genhealth,
                          E = "1_Excellent",
                          VG = "2_VeryGood",
                          G = "3_Good",
                          F = "4_Fair",
                          P = "5_Poor"))

Might want to run a sanity check here, just to be sure…

Checking health vs. genhealth in class4

class4 |> tabyl(genhealth, health) |> adorn_title()
             health                   
   genhealth      E  VG   G   F  P NA_
 1_Excellent    148   0   0   0  0   0
  2_VeryGood      0 324   0   0  0   0
      3_Good      0   0 274   0  0   0
      4_Fair      0   0   0 112  0   0
      5_Poor      0   0   0   0 35   0
        <NA>      0   0   0   0  0   1
  • OK. We’ve preserved the order and we have much shorter labels. Sometimes, that’s helpful.

Multicategorical race_eth in raw class4

class4 |> count(race_eth)
# A tibble: 6 × 2
  race_eth                     n
  <fct>                    <int>
1 Black non-Hispanic         167
2 Hispanic                    27
3 Multiracial non-Hispanic    19
4 Other race non-Hispanic     22
5 White non-Hispanic         646
6 <NA>                        13

“Don’t know”, “Not sure”, and “Refused” were treated as missing.

  • What is this variable actually about?
  • What is the most common thing people do here?

What is the question you are asking?

Collapsing race_eth levels might be rational for some questions.

  • We have lots of data from two categories, but only two.
  • Systemic racism affects people of color in different ways across these categories, but also within them.
  • Is combining race and Hispanic/Latinx ethnicity helpful?

It’s hard to see the justice in collecting this information and not using it in as granular a form as possible, though this leaves some small sample sizes. There is no magic number for “too small a sample size.”

  • Most people identified themselves in one of the categories.
  • These data are not ordered, and (I’d argue) ordering them isn’t helpful.
  • Regression models are easier to interpret, though, if the “baseline” category is a common one.

Resorting the factor for race_eth

Let’s sort all five levels, from most observations to least…

class4 <- class4 |>
    mutate(race_eth = fct_infreq(race_eth))
class4 |> tabyl(race_eth)
                 race_eth   n    percent valid_percent
       White non-Hispanic 646 0.72259508    0.73325766
       Black non-Hispanic 167 0.18680089    0.18955732
                 Hispanic  27 0.03020134    0.03064699
  Other race non-Hispanic  22 0.02460850    0.02497162
 Multiracial non-Hispanic  19 0.02125280    0.02156640
                     <NA>  13 0.01454139            NA
  • Not a perfect solution, certainly, but we’ll try it out.

“Cleaned” Data and Missing Values

class4 <- class4 |>
    select(ID, bmi, inc_imp, fruit_day, drinks_wk, 
           female, exerany, health, race_eth, everything())

miss_var_summary(class4)
# A tibble: 13 × 3
   variable    n_miss pct_miss
   <chr>        <int>    <dbl>
 1 inc_imp        120   13.4  
 2 exerany         42    4.70 
 3 fruit_day       41    4.59 
 4 drinks_wk       39    4.36 
 5 race_eth        13    1.45 
 6 health           1    0.112
 7 genhealth        1    0.112
 8 ID               0    0    
 9 bmi              0    0    
10 female           0    0    
11 hx_diabetes      0    0    
12 mmsa             0    0    
13 SEQNO            0    0    

Single Imputation Approach?

set.seed(43203)
class4im <- class4 |>
    select(ID, bmi, inc_imp, fruit_day, drinks_wk, 
           female, exerany, health, race_eth) |>
    data.frame() |>
    impute_cart(health ~ bmi + female) |>
    impute_pmm(exerany ~ female + health + bmi) |>
    impute_rlm(inc_imp + drinks_wk + fruit_day ~ 
                   bmi + female + health + exerany) |>
    impute_cart(race_eth ~ health + inc_imp + bmi) |>
    tibble()

prop_miss_case(class4im)
[1] 0

Saving the tidied data

Let’s save both the unimputed and the imputed tidy data as R data sets.

write_rds(class4, "c04/data/class4.Rds")

write_rds(class4im, "c04/data/class4im.Rds")

To reload these files, we’ll use read_rds().

  • The main advantage here is that we’ve saved the whole R object, including all characteristics that we’ve added since the original download.